dCCA: detecting differential covariation patterns between two types of high-throughput omics data

Abstract Motivation The advent of multimodal omics data has provided an unprecedented opportunity to systematically investigate underlying biological mechanisms from distinct yet complementary angles. However, the joint analysis of multi-omics data remains challenging because it requires modeling interactions between multiple sets of high-throughput variables. Furthermore, these interaction patterns may vary across different clinical groups, reflecting disease-related biological processes. Results We propose a novel approach called Differential Canonical Correlation Analysis (dCCA) to capture differential covariation patterns between two multivariate vectors across clinical groups. Unlike classical Canonical Correlation Analysis, which maximizes the correlation between two multivariate vectors, dCCA aims to maximally recover differentially expressed multivariate-to-multivariate covariation patterns between groups. We have developed computational algorithms and a toolkit to sparsely select paired subsets of variables from two sets of multivariate variables while maximizing the differential covariation. Extensive simulation analyses demonstrate the superior performance of dCCA in selecting variables of interest and recovering differential correlations. We applied dCCA to the Pan-Kidney cohort from the Cancer Genome Atlas Program database and identified differentially expressed covariations between noncoding RNAs and gene expressions. Availability and Implementation The R package that implements dCCA is available at https://github.com/hwiyoungstat/dCCA.

deg(•) denotes the degree function.Specifically, we define deg(i, •) = q j=1 Aij , deg(•, j) = p i=1 Aij , where A biadjacency matrix with p rows and q columns and an entry Aij measures the correlation strength between the i-th variable in X and the j-th variable in Y (i = 1, • • • , p and j = 1, • • • , q).Each variable Xi or Yj can be considered as a node in a bipartite graph and Aij is the edge connecting between them.deg(i, •) can be calculated from the row sum of the i-th row in A, which represents the sum of correlations (edges) between Xi and all variables of Y. Similarly, deg(•, j) refers to the column sum of all correlations (edges) between Yj and all variables of X.
is the fraction which objective function (3) aims to maximize for the screening procedure.f λ l represents the criteria to extract the dense blocks.λ l is a tuning parameter in the search space of λ l , • • • , λ L in algorithm 2. λ k in (3) denotes the optimal λ l by search in Algorithm 2. A k denotes the k-th block biadjacency that Algorithm 2 aims to extract (∥ • ∥1,1 is the entry-wise L1,1 norm), |N k x |, and |N k y | represent the cardinalities of node sets of X, and Y for the k-th dense block, respectively.f λ l can be computed based on each extracted block using Algorithm 2 (Tsourakakis et al. (2013); Wu et al. (2022)).

Estimation
The Bernoulli parameters in (4) can be estimated by using the maximum likelihood estimation.Specifically, π is estimated based on the entire bipartite graph, while π1, π0 are estimated based on the edges within and outside the dense block, respectively.For i ∈ Nx \ k−1 j=0 , and j ∈ Ny \ k−1 j=0 , the parameters in Bernoulli distributions can be estimated by .

2
H. Lee et al.
Two-step dCCA Approach for Multiple Subgroups (K > 2) In this section, we present a potential solution for applying dCCA to general cases involving K > 2 groups.
Step 1: Pair-wise comparisons Perform pair-wise comparisons by applying dCCA.argmax u∈R p ,v∈R q Cor(Xu, Yv) Because the differential interaction patterns between the two types of omics data may involve different subsets of biological measures (i.e., u and v can be different across pairs).For the breast cancer example, the biological mechanisms (interactions between the two omics datasets) in the Basal-like subtype may differ from those in the Luminal A, Luminal B, and HER2-enriched subtypes.Therefore, conducting pairwise comparisons as the initial step appears to be a valid approach.
Step 2: dCCA for general K subgroups If differences are observed, then we perform dCCA to investigate further across all K subgroups.Here, we propose the dCCA for general K > 2 subgroups: The above objective function in (1) can be optimized by the similar optimization technique summarized in the main manuscript.In Algorithm S1, we provide the early version of the implementation strategy for optimizing (1).
Algorithm S1 dCCA for multi-subgroup Additional Simulation results In this section, we provide the additional results of the simulation studies.Specifically, the results of simulation setting 3 are summarized in Table S1.
Table S1.Simulation Results (Setting 3: There is no differential pattern in the association between X and Y across the groups.):We compare dCCA with the screening procedure (dCCA +Screen ) to dCCA without the screening procedure (dCCA), and three competing methods (sparse CCA (SCCA), sparse LDA (SLDA), and sparse PCA (SPCA)).SCCA Sep and SPCA Sep are used to denote these separate applications, respectively.Subscripts 0 and 1 denote the groups corresponding to Z = 0, and Z = 1, respectively.Application of dCCA to TCGA BRCA data In this section, we applied the dCCA method to The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA-BRCA) data ( https: //portal.gdc.cancer.gov/projects/TCGA-BRCA).Our analysis focused on two specific subtypes: we compared the Basal-like subtype with a combined group of Luminal A and Luminal B subtypes.We have a total The scatter plots of the canonical vectors are displayed in Figure S2.As shown in the heatmap, the association between miRNAs and genes is stronger in the Luminal subtypes (Luminal A '& B).The canonical vectors from dCCA reflect this differential pattern, where the slope of Luminal A & B is larger than that of Basal-like.However, CCA produces nearly identical canonical vectors.The difference in canonical correlation between the Luminals and Basal-like is significant: 0.4337 for dCCA (Luminal A & B: 0.7563, Basal-like: 0.3226) and -0.0083 for CCA (Luminal A & B: 0.9800, Basal-like: 0.9883).This demonstrates that dCCA can more effectively capture the differential association patterns between these clinical groups.
Table S3.Pathway analysis results.Note:Reactome Immunoregulatory in the first column is Reactome Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell.
Fig. S1: Left: Heat map of the correlation matrix from all pairs of miRNA and gene expression.Right: Differential association pattern between miRNA and gene across subtypes.Specifically, Luminal A & B exhibit a stronger association than those of Basal-like.